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Abstract 

Recent technological advances have made lightweight, 
head mounted cameras both practical and affordable and 
products like Google Glass show first approaches to in¬ 
troduce the idea of egocentric (first-person) video to the 
mainstream. Interestingly, the computer vision commu¬ 
nity has only recently started to explore this new domain 
of egocentric vision, where research can roughly be cate¬ 
gorized into three areas: Object recognition, activity de¬ 
tection/recognition, video summarization. In this paper, 
we try to give a broad overview about the different prob¬ 
lems that have been addressed and collect and compare 
evaluation results. Moreover, along with the emergence of 
this new domain came the introduction of numerous new 
and versatile benchmark datasets, which we summarize 
and compare as well. 

1 Introduction 

Most of the classic work in computer vision has been 
devoted to studying either static images or video from 
stationary cameras (such as tracking objects in surveil¬ 
lance applications). Recently, technological advances 
have made lightweight, wearable, egocentric cameras 
both practical and popular in various helds. The GoPro 
camera for instance can be mounted to helmets and is pop¬ 


ular in a lot of sports such as biking, surhng or skiing. 
The Microsoft Sense Cam can be worn around the neck 
and has enough video storage to capture an entire day for 
the idea of “life logging”. Cognitive scientists like to use 
hrst-person cameras attached to glasses (often in combi¬ 
nation with eye trackers such as Tobii or SMI) to study vi¬ 
sual attention in naturalistic environments. Most recently, 
emerging products like Google Glass started making hrst 
attempts to bring the idea of wearable, egocentric cameras 
into the mainstream. 

From a computer vision standpoint, videos from these 
hrst-person devices pose a lot of challenges. Because the 
camera is constantly moving, the motion is highly non¬ 
linear and unpredictable. As a result, objects may rapidly 
disappear and reappear in the held of view. In extreme 
cases (such as sport videos), one must also expect things 
like motion blur, splashing water or glare. On the other 
hand, some qualities of egocentric video may be helpful 
for specihc applications. For example, objects that the ob¬ 
server manipulates or people and faces that the observer 
interacts with, tend to naturally be centered in the view 
and are less likely to be occluded then they might be if 
captured from a static, third person camera. 

In the next section, we will introduce the most recent 
work from the computer vision community in the do¬ 
main of egocentric video. We further try to point out 
egocentric-specihc challenges that occurred within the 
given problems, but also mention situations were the ego- 
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centric paradigm was actually useful. We emphasize that 
egocentric video is an emerging field and a lot of the work 
that we will reference can be considered as pioneering 
work. As a result of that, not many things are built on 
top of each other and direct quantitative comparisons be¬ 
tween different works are often difficult. 

Another effect of the novel nature of the egocentric do¬ 
main is the emergence of numerous new and very versa¬ 
tile data sets. While briefly explaining the individual data 
sets along with the work in section |2] we give a detailed 
overview about publicly available datasets in section [3 

In section 01 we summarize and compare results from 
the previous sections and finally section |5] concludes the 
paper. 

2 Recent Work 

In this section, we introduce recent work in the field of 
egocentric video. We group this work into three cate¬ 
gories. The first category deals with object recognition 
with respect to objects that are being manipulated (by 
hand) by the first-person observer. The second category 
deals with the detection and recognition of first-person 
actions and activities. We will see that this category natu¬ 
rally emerges from the first one, as most of the considered 
activities are characterized by the objects being used. The 
third category deals with so called “life logging” video 
data. This data is mainly characterized by the fact that 
it involves hourlong, continuous video data depicting the 
“life” of the first-person observer. Work in this area usu¬ 
ally deals with data summarization, i.e. the extraction 
of relevant or representative frames or actions. However, 
there is also work in more specific tasks such as the de¬ 
tection of social interactions based on egocentric video 
recorded by a group of people in a theme park. 

2.1 Object Recognition 

One of the first analyses of object recognition in ego¬ 
centric video was done by Ren and Philipose m. Mo¬ 
tivated by the idea that recognizing handled objects can 
provide essential information about a person’s activity, 
they wanted to explore the challenges and characteristics 
of object recognition in the context of egocentric video. 
They collected a video dataset consisting of 42 everyday 


objects (milk carton, watering can, etc.), where each ob¬ 
ject was being manipulated by hands in an object-specfic 
way. To obtain some baseline results for their dataset, they 
annotated a small subset of frames with ground-truth ob¬ 
ject versus background segmentations. They used a stan¬ 
dard SIFT based recognition system described in ||2l and 
trained a multi-class SVM. They achieved a 12% recog¬ 
nition rate compared to a random chance of 2.4%. They 
went on to quantify the influence of various egocentric- 
specific challenges, such as limited texture of objects, 
background clutter and hand occlusion. To gain an upper 
bound for recognition performance, they used the SIFT 
recognition system on clean exemplar images of their ob¬ 
jects, obtaining an average accuracy of 63.7%. Simulating 
occlusion on the clean exemplars had the accuracy drop 
down to 57.0% while simulating background clutter re¬ 
sulted in a 20% drop in accuracy and combining both had 
the accuracy drop down to 30.3%. They suggest motion 
and location priors as well as hand detection as future re¬ 
search directions. 

Follow-up work has been done by Ren and Gu |13 who 
developed a motion-based approach to segment out fore¬ 
ground objects in egocentric video in order to improve 
object recognition accuracy. The idea is based on the ob¬ 
servation that there are some regularities with respect to 
motion in egocentric video that are useful towards motion 
segmentation: During object manipulation, hands and ob¬ 
jects have the tendency to appear near the center of the 
view and body (i.e. camera) motions are rather small and 
horizontal. Their model explicitly addresses this with a 
motion prior and a location prior for each pixel. The dis¬ 
tribution for the location prior is built by averaging ground 
truth segmentation masks and the motion prior is based on 
optical-flow results obtained from video parts that only 
contain background (no hands or objects), thus giving an 
average flow estimation for each background pixel. Ad¬ 
ditionally, they used temporal cues that take segmentation 
masks from previous frames into account. Finally, they 
used the coarse-to-flne variational optical flow algorithm 
of a to create dense optical flow across two frames and 
then used RANSAC to fit the motion vectors into affine 
layers. Equipped with these motion features and priors, 
they trained a max-margin classifier for pixelwise figure- 
ground classification and cleaned up the results using the 
standard Graph Cut algorithm. For testing, they used the 
same 42 object dataset as m and improved the accuracy 
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of the SIFT based recognition system from 12% to 20%. 
They also tested a latent HOG based recognition system 
0 and found that the accuracy improved from 38% to 
46%. 

Fathi et al. 0 took advantage of the egocentric 
paradigm (objects of interest tend to be centered and at 
a large scale) to learn object classihcation and segmen¬ 
tation with very weak supervision. The motivating idea 
to use object recognition as a way to make inference on 
possible activities is similar to that of lUl, but is taken 
a step further in the sense that they explored egocentric 
activities involving multiple objects (such as making a 
peanut butter and jelly sandwich). They hypothesized that 
the co-occurence of different objects within those activ¬ 
ities can be exploited for object detection and localiza¬ 
tion. They performed figure-ground segmentation as well, 
but their approach differed from 0 as it allowed objects 
to become part of the background after being manipu¬ 
lated. This is accomplished by splitting the video into 
short intervals and creating a local background model for 
each. For the weakly supervised learning, they collected 
a dataset of 7 daily activities involving multiple objects 
(making coffee/tee/sandwiches). Each video was only la¬ 
beled with the list of objects it contained. To learn an 
appearance model for each object type, they used the di¬ 
verse density based multiple instance learning framework 
ofQ. They further used equality constraints to assign the 
same label to regions with signihcant temporal connec¬ 
tions. The object recognition accuracy ranged from about 
10% (sugar) to about 95% (coffee). Additionally, their 
figure-background segmentation approach outperformed 
0 on the 42 object dataset, having a 48% segmentation 
error rate as opposed to 67%. 

2.2 Activity and Action Detection 

Many authors recognized that a lot of activities that are 
interesting from an egocentric perspective are character¬ 
ized by the observer manipulating objects in front of him. 
This is very different from third person videos where ob¬ 
jects might be hard to see and thus, people focussed on 
activities that can be modeled by different body move¬ 
ments (e.g. dancing). In this section, we will use the ter¬ 
minology that has been established in recent work on ego¬ 
centric activity and action detection, which is that actions 
describe simple, straightforward things such as “take the 


knife” or “open the fridge”, while an activity describes a 
more complex aggregation of actions such as making cof¬ 
fee. 

2.2.1 Early Work Using Gist 

Early work in the domain of both unsupervised action seg¬ 
mentation and supervised action classification was done 
by Spriggs et al. 0. They inti'oduced the “CMU kitchen” 
dataset that contains multimodal measures, including ego¬ 
centric video, of people cooking different recipes (brown¬ 
ies, pizza, etc.) in a kitchen environment. Each frame 
was labeled with an action class (such as “stirring”). Eor 
action segmentation, rather than trying to recognize ob¬ 
jects like most of the follow-up work, they computed the 
gist 0 of each frame. The assumption is that, under 
the egocentric paradigm, specific actions are performed 
in front of a somewhat constant background, making a 
gist feature vector a reasonable approach to model each 
frame. They performed PCA to reduce the vector dimen¬ 
sionality and estimated different Gaussian mixture mod¬ 
els to investigate whether these features cluster into sim¬ 
ilar scenes. Eor some activities, such as “stirring”, they 
saw promising results (70% of frames labeled with this 
action were assigned to the right cluster) but noted that 
results do not generalize well as model parameters need 
to be varied to capture distinct sets of actions. They also 
explored supervised action classihcation by training an 
HMM with a mixture of Gaussians output on the gist fea¬ 
tures and obtained an average classihcation accuracy of 
9.38% (chance being 3%). Lastly, they applied a sim¬ 
ple KNN model, where each test frame from one subject 
is given the label of the frame with the smallest Euclid¬ 
ian distance from the set of frames of all other subjects, 
reaching a classihcation accuracy of 48.64%. 

2.2.2 Object-based Activity Detection 

Further research on activity detection was done by Pirisi- 
avash and Ramanan IflOll . whose work stands out due to 
their large, versatile and fully labeled dataset. They cap¬ 
tured 18 daily indoor activities such as brushing teeth, 
washing dishes, or watching television, each performed 
by 20 different subjects in their respective apartments. 42 
different object classes involved in these activities were 
annotated with bounding boxes. Each object also had a 
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label depicting whether it is currently active (in hands) or 
not. Also driven by the idea that activities are all about 
the objects being involved, they used their data to build 
an activity model that explicitly models object use over 
time. For every frame of a given activity, they used the 
part-based object model by flm to record a score based 
on the most likely position and scale for each of their 
42 object classes. Averaging this score over all activity 
frames yielded a histogram of object scores for a spe¬ 
cific activity. They went on to temporally split the video 
into halves in a pyramid fashion, each time calculating 
the object score histogram, and thus ending up with an 
activity model that describes object use over time. They 
learned a linear SVM on these models. Trained with all 
objects, they achieved a 32.6% activity classification ac¬ 
curacy (chance being 5.6%) and trained with only active 
objects they achieved 40.6% accuracy. 

An alternative, unsupervised activity model was pro¬ 
posed by Fathi et al. IfT^ . Continuing their own work 
on object recognition in egocentric video |j6|, they pro¬ 
posed a graph based model that takes advantage of the 
semantic relationship between activities, actions and ob¬ 
jects. They worked on the same dataset as they did 
in ©, which contains activities such as making various 
kinds of sandwiches. Based on detected objects, object- 
hand interactions and a set of action labels (“spread but¬ 
ter on bread”, etc.) they used an approach similar to 
Expectation-Conditional Maximization 113 to learn ac¬ 
tions and then learn activities from actions. Then, the in¬ 
ferred activity label was fixed and used to enhance action 
recognition results, as the activity can limit the set of pos¬ 
sible actions as well as enforce a certain order. Finally, 
they enhanced their initial object recognition by learning 
a probabilistic object model that incorporates the inferred 
action priors. They recognized 6 out of 7 activities cor¬ 
rectly and their action recognition accuracy was at 32.4% 
(chance being 1.6%). They also showed that this frame¬ 
work indeed improved their initial object recognition per¬ 
formance, achieving better results for almost all object 
classes. 

Fathi et al. extended their work in lfT4ll by additionally 
considering eye gaze, using calibrated, head-mounted eye 
trackers in combination with egocentric cameras. They 
raised the question whether knowing the fixation loca¬ 
tions helps to better recognize actions and vice versa. 
This approach is motivated by psychological studies IfTSlI 


which demonstrate that during object manipulation tasks 
a substantial percentage of gaze fixations fall upon task¬ 
relevant objects. They used a generative model to describe 
the relationship between egocentric action and gaze lo¬ 
cation. This means they learned the probability of tran¬ 
sitioning to a gaze location gt, given gt-i and the cur¬ 
rent action a, as well as the likelihood of an image fea¬ 
ture xt, given the current action a and the gaze position 
gt- The image features were based on object features de¬ 
scribed in their earlier work 113, as well as appearance 
features and future manipulation features. The appear¬ 
ance features were used to describe the fixated part of an 
object and were based on color and texture histograms in 
a circular area around the gaze location. Future manipu¬ 
lation features were aimed to take advantage of the fact 
that gaze is usually a split second ahead of the hands, so 
knowing the hand location a few frames ahead provides a 
cue of the gaze location in the current frame. They used 
a new dataset involving different kinds of meal prepara¬ 
tions similar to their previous work but extended by 
the gaze data. They found that incorporating gaze infor¬ 
mation improved the action recognition accuracy to 47% 
compared to 27% when using the method of lEl. They 
also found promising results when predicting gaze loca¬ 
tions given the action. However, when inferencing both 
action and gaze location action recognition accuracy only 
improves marginally (29%). 

2.2.3 State-based Activity Detection 

Very recently, Fathi et al. proposed a new approach 
to model actions in egocentric videos M, exploit¬ 
ing the fact that goal-oriented actions (“open coffee 
jar”) within object-manipulation activities (making cof¬ 
fee/sandwiches) can be detected by state changes of the 
objects being involved. Thus, for training purposes, they 
annotated each action with start frame, end frame, action 
label as well as a set of nouns describing the objects being 
involved. Focussing only on foreground objects ||6l, they 
discovered regions that changed before and after the ac¬ 
tion and clustered them into regions that constantly appear 
during the action to prune out irrelevant regions (such as 
hands). They then described those regions with color, tex¬ 
ture and shape features and trained a linear SVM to learn 
a state-specific region detector. The action itself was then 
described as a quantized response of start and end frame 
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to each region detector. With those responses, a second 
linear SVM was trained to build an action detector. They 
validated their model in terms of action recognition and 
activity segmentation, achieving a 39.7% action recog¬ 
nition accuracy (based on 61 action classes) and outper¬ 
formed their previous work in ifT^ . They achieved a 33% 
accuracy for activity segmentation, based on the percent¬ 
age of test video frames that had been labeled with the 
correct action. 

2.2.4 Interaction and Sport Activities 

Ryoo and Matthies recently were the hrst to explore 
interaction-level human activities from a hrst-person view 
ini. Motivated by surveillance, military or general 
human-robot interaction scenarios, they constructed a 
dataset of humans directly interacting with the egocen¬ 
tric observer. Interactions varied from friendly (shaking 
hands or petting the observer) to hostile (punching the 
observer or throwing objects at the observer). Based on 
the idea that interaction with the observer causes a lot of 
ego-motion, they used a combination of global and lo¬ 
cal motion descriptors to depict different activities. For 
global motion, they applied a conventional pixel-wise op¬ 
tical flow algorithm and built a histogram based on lo¬ 
cation and directions of the flow. For local motion, they 
interpreted the video as a 3-D XYT volume by concate¬ 
nating frames over time and applied the cuboid feature de¬ 
tector by ifTSi to obtain video patches that contain salient 
motion. These motion descriptors were clustered using k- 
means to obtain a set of visual words. They represented 
an activity video as a histogram of these words and hnally 
trained an SVM. Results were evaluated in terms of activ¬ 
ity classihcation and detection, receiving a 89.6% classi¬ 
fication accuracy (based on 7 different activities), as well 
as an average detection precision of 0.709. 

Kitani et al. m observed the increased usage of ego¬ 
centric cameras in sport videos (biking, skiing, etc.). They 
developed a fast, unsupervised approach to index videos 
into different ego-actions that is supposed to help the ath¬ 
lete to locate and review specific parts without the bur¬ 
den of manual search. Similar to ini, they leveraged 
the fact that first-person sport videos contain lots of ego- 
motion and used optical flow histograms to describe the 
motions of a specihc sport video. As a lot of the sport 
activities contain periodic movements, they additionally 


performed a DFT on the optical flow amplitudes to ob¬ 
tain frequency histograms. They used a Dirichlet mixture 
model 1201 to first infer a motion codebook and then infer 
ego-action categories. They evaluated their performance 
on both controlled, choreographed videos as well real- 
world sport videos obtained from YouTube and reported 
an F-measure (considering both precision and recall) for 
each sport. They achieved an F-measure of 0.93 for the 
choreographed videos and and average F-measure of 0.6 
for the sport videos. Ego-actions varied between sports 
and involved labels such as “hop down”, “turn left” or 
“wedge left” for skiing. 

2.3 Life Logging Video 

Another area that is particularly of interest in the ubiqui¬ 
tous computing community and contains egocentric video 
is the idea of “life logging”. Here, a hrst-person camera 
continuously records a whole day of its wearer’s life. The 
overall motivation that is mentioned by a lot of authors 
is to eventually develop systems that can serve as a retro¬ 
spective memory aid for people with memory loss prob¬ 
lems im. Thus, a common goal is to summarize long, 
egocentric video or detect novel, anomalous events. 

2.3.1 Video Summarization 

Doherty et al. were among the first to investigate 
keyframe selection methods in the egocentric domain by 
looking at the Microsoft SenseCam, a camera worn around 
the neck that takes an image every couple of seconds (an 
average of 1,900 images a day) to create a passively cap¬ 
tured, visual life log. They pointed out that a lot of the es¬ 
tablished mechanisms for keyframe selection do not trans¬ 
late directly to the domain of life logging video, as they, 
for instance, rely on motion analysis and, due to the very 
low frame rate of their camera, motion is virtually non¬ 
existing. Also, passive capture devices may not always 
capture high quality images and hands or clothing cover¬ 
ing parts of the lens are quite common. First, the authors 
split the set of images into different events where event 
boundaries are determined by high dissimilarity between 
frames according to a distance metric based on color and 
edge descriptors. They compared and investigated var¬ 
ious approaches to select a keyframe for each of those 
events. Approaches varied from very simple solutions 
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such as taking the middle image of the event, over tak¬ 
ing the image that is closest to the average value of all 
images in the event, to more complex solutions like the 
image that is closest to the event average, farthest from the 
average of other events and performs well on various im¬ 
age quality tests for sharpness and contrast. Over 13,000 
keyframes were judged by user ratings, where the most 
complex approach had a 8.4% higher score than the base 
line (middle frame). They found that issues mainly occur 
during events that include a lot of motion (such as walking 
home) as there may be vast differences between images of 
the same event due to the nature of the camera and its low 
frame rate. 

Lee et al. devised a method that aims to summarize 
life logging video material and goes beyond common 
keyframe detection by focussing on “importance cues” 
specific to the egocentric domain, such as objects and 
people the camera wearer interacts with E3i . In partic¬ 
ular, they segment each frame into multiple regions us¬ 
ing a constrained parametric min-cuts method ll24l and 
learn a regressor that predicts an importance score for 
each region. The score is based on a combination of var¬ 
ious features; interaction (euclidean distance of region 
centroid to hand centroid, where hand is detected based 
on skin color), gaze (euclidean distance to center), fre¬ 
quency (appearance of region over multiple frames based 
on DoGh-SIFT descriptors), object-like appearance (based 
on a ranking function of ll24l l. object-like motion, and 
likelihood of a person’s face within a region (using the 
Viola-Jones method my They ended up temporally 
clustering the video into different events based on color 
histogram differences and represented each event with 
the frame that has the highest importance score based 
on the regressor. For training and evaluation, they used 
Amazon’s Mechanical Turk to manually label and seg¬ 
ment important regions in their video data, which con¬ 
sisted of multiple hours of daily life activities among 
four different subjects. They evaluated the performance 
on classifying important regions correctly (by threshold¬ 
ing the regressor), as well as the quality of the keyframe 
summary. They found that their method performed bet¬ 
ter in predicting important objects than object-like fea¬ 
tures alone or low-level saliency methods. To quantify 
the perceived quality of the keyframe summaries, they 
asked the subjects that wore the camera to compare their 
method with baseline methods (such as uniform sampling 


among events), finding that their method was found better 
68.75% of the time. 

Lu and Grauman 1^ extended this work by develop¬ 
ing a story-driven (rather than object-driven) approach to 
summarize egocentric life logging video. The idea is to 
devise an influence metric that captures event connectiv¬ 
ity and accounts for how one event leads to another, in 
order to create a summary that provides a better sense 
of a story. They also introduced a novel temporal seg¬ 
mentation method to cluster the video material into differ¬ 
ent events, which was specifically designed for egocentric 
video. They found that the method based on changes in 
color histograms which they used in previous work 12^ 
does not really work well for egocentric video due to 
its continuous nature. Instead, they tried to distinguish 
whether the camera wearer is static, in transit (physically 
traveling from one point to another), or moving the head. 
They learned an SVM to predict these scenarios based 
on dense optical flow features and blurriness scores f2n\ . 
They found that this method produced events (e.g. sets of 
frames) of an average length of 15 seconds. They repre¬ 
sented each event in terms of detected objects. For known 
environments, objects were represented as scores based 
on a bank of object detectors and for uncontrolled en¬ 
vironments, objects were essentially visual words based 
on object-like windows ll28l . They went on to consider 
each event as a node in a chain. Finding a story-driven 
summary consisting of k frames then comes down to 
finding the optimal, order-preserving iT-node subchain 
with respect to story, importance and diversity constraints. 
Basically, the importance score was estimated similarly 
to their previous work ll23l . the story constraint favored 
event pairs with similar object instances, and the diver¬ 
sity constraint made sure that sequential events are not 
too similar. They found a good chain with the approx¬ 
imate best-first search strategy described in ll29l . They 
evaluated their performance in the form of a user study 
based on their own dataset ll2^ as well as the “Activities 
of Daily Living” dataset from 01. To do so, they had 
34 subjects compare their approach with other techniques 
such as uniform sampling or their previous work 12^ . 
They found that an average of 87% of the subjects pre¬ 
ferred their approach among different datasets and base¬ 
lines. 
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2.3.2 Novelty Detection 

Aghazadeh et al. 1301 looked at videos from a subject who 
recorded his one-hour commute to work multiple times, 
wearing an egocentric camera that captures one image per 
second. Motivated by the idea to use life logging cameras 
as a memory support system for the disabled ETI . they 
proposed a method of novelty detection, where a novel 
event might be “meeting a friend” during the otherwise 
similar sequences of the subject going to work. They 
achieved this by exploiting the invariant temporal order 
of the activities across the different sequences to automat¬ 
ically align a query sequence with the other sequences. 
The idea is that a bad alignment yields a novelty in the 
query action as it is likely caused by an event that has not 
been observed in the reference sequences. They derived a 
similarity measure between two frames based on VLAD 
(vector of locally aggregated descriptors, proposed by 
ISTl ) as well as geometric similarities, represented by the 
epipolar geometry between the two frames (i.e. the fun¬ 
damental matrix). Comparing each frame from the query 
sequence with each frame from a reference sequence cre¬ 
ates a cost matrix whose minimum cost path connecting 
the first and last frame (with the constraint that matches 
have to occur in temporal order) yields the best alignment 
between the two sequences. Finally, if a frame from the 
query sequence has a minimum match cost among all ref¬ 
erence sequences that is above some threshold, it is con¬ 
sidered a novelty. From 31 sequences of the subject going 
to work, four of them contained an event that the authors 
considered novel and all of them were detected by the al¬ 
gorithm. 

2.3.3 Social Interactions 

Fathi et al. ll^ looked at egocentric life logging video 
for social events, in particular people spending a day at 
an amusement park, and developed a method for the de¬ 
tection and recognition of social interactions. This was 
motivated by the idea that typically, one or more indi¬ 
viduals have to play the role of the “group videographer” 
to capture memorable events, which prevents them from 
fully participating in the group experience. Moreover, 
a lot of memorable moments may occur spontaneously 
and the authors’ thesis is that the presence or absence of 
social interactions is an important cue as to whether an 


event is viewed as memorable. The idea is that different 
kinds of social interactions can be detected/recognized by 
faces and their spatial attention. For instance, a mono¬ 
logue should have multiple observing faces attending the 
talking face. To model this, they first computed the ori¬ 
entation of each detected face using the Pittpatt face de¬ 
tection softwar^H and then used the camera’s intrinsic pa¬ 
rameters, as well as prior knowledge of face sizes at cer¬ 
tain distances in order to estimate face locations and ori¬ 
entations in 3D. To get an estimate of the locations that 
the faces are attending, they built an MRF that incorpo¬ 
rates these 3D locations/orientations as unary potentials, 
but also uses pairwise potentials between faces that bias 
nearby faces towards looking at the same location in the 
scene. They used an a-expansion method to optimize the 
MRF. Having an estimate for each face’s attention, they 
assigned roles to individual faces based on features such 
as the number of faces looking at x. Based on those, 
they could classify an interaction as dialogue, discussion, 
monologue and other labels, using a Hidden Conditional 
Random Field that also incorporated temporal infor¬ 
mation. They reported results for both attention estima¬ 
tion as well as social interaction detection and recogni¬ 
tion. Based on about 1000 hand-labeled frames, their 
method correctly estimated who is looking at whom in 
71.4% of the cases. For detection, they presented ROC 
curves for different forms of interaction, where the aver¬ 
age area under the curve is 0.88. The average recognition 
accuracy was 55% (chance being 20%). 


3 Datasets 

Figure [T] gives a compact overview over all datasets from 
the work mentioned in section |2] that are publicly avail¬ 
able. We briefly describe the data as well as what kind 
of labeling is provided and also list the URLs to websites 
that contain further explanations and download links. 

Most authors try to establish their own dataset and con¬ 
sequently none of the datasets has taken over the role 
of a true benchmark dataset. An exception might be 
the “Intel 42 Objects” dataset for the task of egocentric 
object recognition, which has also been used by ||6l to 

'Pittpatt has since been acquired by Google Inc. and the software is 
not publicly available anymore. 
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Name 

Description 

Labeling 

Used in 

URL 


Intel 42 Objects 

10 video sequences (lOOK frames) 
from two human subjects manipu¬ 
lating 42 everyday object instances 
such as coffeepots, sponges, or cam¬ 
eras 

each frame labeled with name of ob¬ 
ject; exemplar photos of objects with 
forground/background segmentation 

unnj 

http://Seattle.int 

el-research. 

GeorgiaTech Egocen- 
ti'ic Activities (GTEA) 

7 types of daily activities such 
as making a sandwhich/coffee/tea; 
each performed by 4 different sub¬ 
jects 

each activity video is labeled with 
list of objects being involved; each 
frame has left hand, right hand, and 
background segmentation masks 


http://www.cc.gate 

ch.edu/~afatl 

emu.edu/ 

. ics.uci.edu 

ch.edu/~afatl 

CMU kitchen 

multimodal dataset of 18 subjects 
cooking 5 different recipes (brown¬ 
ies, pizza, etc.); also contains audio, 
body motion capture, and IMU data 

each frame is labeled with an action 
such as “take oil”, “crack egg”, etc. 

1^ 

http://kitchen.cs. 

Activities of Daily Liv¬ 
ing 

18 daily indoor activities such as 
brushing teeth, washing dishes, or 
watching television, each performed 
by 20 different subjects 

42 object classes that ai‘e involved 
in the activities are annotated with 
bounding boxes in all frames 

muj 

http://deepthought 

GeorgiaTech Egocen- 
ti'ic Activities - Gaze+ 

7 types of meal preparation such as 
making pizza/pasta/salad; each per¬ 
formed by 5 different subjects 

each frame has eye gaze fixation 
data, timeframes of different activi¬ 
ties such as “open fridge” are anno¬ 
tated 

“EH 

http://WWW.cc.gate 

UT Egocentric 

4 videos from head-mounted cam¬ 
eras capturing a person’s day, each 
about 3-5 hours long 

not available 

(23112^ 

http://vision.cs. i 

texas.edu/pre 

ch.edu/~afatl 

First-Person Social In¬ 
teractions 

day-long videos of 8 subjects spend¬ 
ing their day at Disney World 

timeframes for different activities 
(“waiting”, “train ride”, etc.) and 
social interactions (dialogue, discus¬ 
sion, etc.) are annotated 

nn 

http://WWW.cc.gate 


Figure 1: Overview of publicly available egocentric video datasets. Row one deals with object recognition. Rows 2-5 
deal with activity detection/recognition. Rows 6 and 7 deal with life logging video data. 
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test the performance of their motion-based foreground- 
background segmentation method. Further, the “Activi¬ 
ties of Daily Living” dataset was used by Il26l to test their 
story-driven video summarization method. However, as 
this dataset was primarily collected for the task of activity 
recognition nni, a direct comparison between both works 
was not possible. 

4 Summary and Comparison 

In this section, we summarize the key aspects of the work 
that was introduced in the previous sections and draw 
comparisons where possible. 

Ren and Philipose fTl were the hrst to test standard 
recognition systems for the task of recognizing handled 
objects in egocentric video. They continued to hnd that 
foreground-background segmentation can successfully be 
done with optical flow based approaches and helps to im¬ 
prove the recognition results, as handled objects tend to 
be in the foreground 0. Their segmentation method was 
improved by Fathi et al, who also were the first to 
consider multiple objects being manipulated as part of 
kitchen activities like making sandwiches. Fathi et al. 
went on to experiment with various weakly supervised 
approaches to recognize such activities, including object 
co-occurrence and changes in object states lUllIT^. They 
are also the only group to experiment with the influence 
of gaze with respect to activity recognition M- Pirisi- 
avash and Ramanan Qol were successful at recognizing 
more versatile household activities. However, unlike the 
work of Fathi et al, their method is strongly supervised. 
Ryoo and Matthies started looking at interaction level ac¬ 
tivities such as shaking hands E). They discovered that 
activities that contain a lot of ego-motion can be well de¬ 
scribed with optical flow based approaches. Kitani et al. 
EH came to similar conclusions when looking at sport 
activities that also involve a lot of ego-motion. 

In parallel, researchers started looking at egocentric 
video for life logging purposes. Doherty et al. l22l were 
the first to investigate keyframe selection methods in ego¬ 
centric video, finding that a lot of the established methods 
to segment video into coherent parts do not work well due 
to the continuous nature of the video data. Followup work 
by Lee et al. as well as Lu and Grauman 12^ l26l inves¬ 
tigated import objects and people as features to build bet¬ 


ter methods for keyframe extraction and summarization 
of egocentric life logging video. In contrast, Aghazadeh 
et al. looked at life logging video of one subject over 
multiple days and detected novel or out of the ordinary 
activities. 

5 Conclusion 

In the previous sections, we gave a broad overview re¬ 
garding the different problems in the domain of egocen¬ 
tric video that have recently been addressed in the com¬ 
puter vision community. We showed that research could 
roughly be grouped into three categories: object recogni¬ 
tion, activity and action detection, life logging video sum¬ 
marization. All work in this domain is at a very early 
stage: The hrst publications on egocentric object recogni¬ 
tion m and action segmentation date back to the hrst 
(out of two) IEEE workshop on egocentric vision during 
CVPR 2009. Early work on egocentric video in life log¬ 
ging scenarios only dates back to 2008 ll22l . As one re¬ 
sult of this, almost all publications introduce their own, 
novel data sets while working with other authors’ data re¬ 
mains the exception. Consequently, no dominant bench¬ 
mark datasets have emerged so far like they have in other 
computer vision areas such as general object recognition. 

Despite the novel nature of the egocentric vision do¬ 
main, we can see some trends that span across all research 
categories: Egocentric video is all about objects. In hrst 
person videos, objects of interest tend to be naturally cen¬ 
tered and at a large scale while being subject to relatively 
little occlusion, which makes egocentric video very con¬ 
venient for object detection and classihcation. Addition¬ 
ally, optical how based methods seem to work very well 
for the task of segmenting foreground objects (that are 
manipulated by hands) from background noise and are 
used in almost all recent publications to improve recog¬ 
nition results. This object-centered idea expands to ac¬ 
tion and activity recognition. Traditional work in this area 
(with video from third person cameras) usually involves 
approaches that use body conhgurations and movements 
as main features and try to detect, for instance, sport ac¬ 
tivities. In contrast, activities that are interesting from an 
egocentric perspective almost always involve objects that 
are being manipulated, while body movements are of little 
help. Consequently, almost all the work on activity recog- 
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nition presented in section|2]used object detection in some 
way. Analogously, a lot of the work on life logging sum¬ 
marization uses interacted objects as cues for interesting 
or representative frames, resulting in better keyframes and 
summarizations than commonly used methods. 
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